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using the knowledge base, determines from the features a predicted classification for the 
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in which a number of features are input by a user to the classifier together with a 
classification with which the features are associated. 

WO 97/04400: 

The present invention relates to a pattern recognition system that uses an artificial neural network 
architecture with novel mechanisms for object and 2-D pattern (signal) recognition in cluttered 
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document vector derived for classifying document and folder vector to decide category in which 
classifying document belongs. 

JP 11 039313: 
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eliminating unsuitable characteristic from learning data based on calculated relation degree of 
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JP 11 085797: 

Automatic document-classifying apparatus for e.g. hard disc - has classification decision unit that 
compares document vector derived for classifying document and folder vector to decide category 
in which classifying document belongs 
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,ARTMAP system, extractions means and so forth. 
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Industrial applicability is given. 
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(57) Abstract: Document classification apparatus is disclosed comprising feature extraction means (20) for extracting a plurality 
of features from a document and a classifier (30) operable to process the document in a knowledge acquisition mode in which in- 
formation for classification is acquired from the features of the document and added incrementally to a knowledge base (105) and 
in a document classification mode in which the classifier (30), using the knowledge base, determines from the features a predicted 
classification for the document, the classifier (20) being switchable between the modes under user control. The apparatus is further 
operable to perform rule insertion in the knowledge acquisition mode in which a number of features are input by a user to the clas- 
sifier together with a classification with which the features are associated. 



WO 01/14992 



1 



PCT/SG99/00089 



DOCUMENT CLASSIFICATION APPARATUS 
BACKGROUND AND FIELD OF THE INVENTION 

5 

This invention relates to apparatus for classifying documents. 

Traditionally, documents which arrive at a central location, for example a post room or 
facsimile machine and which need to be distributed to a certain destination are sorted and 

10 delivered by hand. Efforts have been made to automate this process. For example, it has 
been proposed in US 5,461,488 to provide apparatus which identifies the destination of a 
facsimile by applying document image analysis and recognition techniques to the 
facsimile. US 5,461,488 provides routing based on identification of recipient name. 
However, for many faxes received, for example in information gathering or public service 

15 industries, information identifying a specified recipient may not be present, so that such 
faxes would not be routed automatically. 

General text classifying systems which classify documents into one or more categories 
have been proposed in US 5,371,807 and US 5,675,710. Such systems use only a single 
20 classification strategy, either profile-based, having a keyword/character profile for each 
category or rule-based in which category knowledge is represented in the form of rules. 
The systems also use only a single knowledge acquisition strategy, either statistically 
learned knowledge or user-specified knowledge to provide the knowledge base with which 
text from a document to be classified is compared to provide the document classification. 
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It is a disadvantage of the prior art systems noted above that they are prone to mis- 
classification and consequent mis-routing of documents, as well as cumbersome operation. 

5 It is an object of the invention to provide an improved document classification apparatus. 
SUMMARY OF THE INVENTION 

According to the invention, there is provided document classification apparatus 
10 comprising feature extraction means for extracting a plurality of features from a document 
and a classifier operable on the extracted features to process the document in a knowledge 
acquisition mode in which the association of a classification with the document is added 
incrementally to a knowledge base or in a document classification mode in which the 
classifier, using the knowledge base, determines a predicted classification for the 
15 document, the classifier being switchable between the modes under user control . 

The features are preferably formed into a feature vector for input to the classifier and the 
features preferably comprise classification-associated words or phrases which may appear 
in the document. The extracting means may be arranged to provide a measure of the 
20 frequency of occurrence of the features in the document. 

The classifier may comprise a supervised ART system, preferably an ARAM system of 
the type disclosed in "Adaptive Resonance Associative Map", an article by one of the 
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present inventors Ah-Hwee Tan, published in "Neural Networks", Vol 8 No 3 pp 437- 
446. 1995 or an an ARTMAP system of the type disclosed in US 5,214.715. 

The apparatus may further be operable in knowledge acquisition mode to process a 
5 plurality of training documents with associated classifications as a batch. 

The apparatus may further be operable in a rule insertion sub-mode of the knowledge 
acquisition mode in which a plurality of features are input by a user to the classifier 
together with a classification with which the features are associated. 

10 

The apparatus may further comprising a router arranged to route the document to one of a 
plurality of destinations in dependence upon the classification and the classification may 
have associated therewith a confidence value comparable to a threshold, the router being 
arranged to make an automatic routing or manual routing decision in dependence upon the 
15 comparison, with a said destination being a system administrator, responsible for manual 
routing. 

The described embodiment provides a document classification apparatus which allows 
learning to be performed in an incremental way by allowing a system administrator to 
20 correct document classification mistakes as they occur, the apparatus learning from these 
mistakes. By incremental learning of new cases does not require re-learning of previous 
cases, thus eliminating the need to preserve past cases for re-learning. While the described 
embodiment focuses primarily on incremental learning, the apparatus is further able to 
perform learning of a plurality of cases as a batch. During batch learning, the apparatus 
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learns each case one by one and accumulates the classification information into the 
knowledge base. Besides learning from training data, the apparatus also allows rules to be 
inserted into the learning process, leading to a more flexible learning environment. 

5 The apparatus is furthermore able to determine a confidence that the classification of a 
particular document is correct in the form of a confidence value. This confidence value is 
compared to a threshold parameter to decide if automatic or manual routing is desirable. 
Adjustment of this threshold parameter allows the degrees of manual and automatic 
routing to be controlled, by adjustment of the threshold to match a desired confidence 
10 value, thus allowing a smooth transition from a state where manual routing is favoured to 
one, as the classifier becomes more accurate, that favours automatic routing. 
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BRIEF DESCRIPTION OF THE DRAWINGS 

Embodiments of the invention will now be described by way of example with reference to 
the accompanying drawings in which: 

5 Figure 1 is a schematic diagram illustrating the structure of the described embodiments of 
the invention; 

Figure 2 is a diagram illustrating the document classifier of Figure 1 in a document 
classification made; 

10 

Figure 3 is a diagram illustrating the modes of operation of the embodiments of the 
invention; 

Figure 4 is a diagram of an ARAM system used as a document classifier in an 
15 embodiment of the invention. 

Figures 5,6, and 7 summarize the parameter setting and the relevant functional blocks of 
the document classification system in the learning, rule insertion, and document 
classification modes respectively. 

20 

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENT 

With reference to Figures 1-3, a document classification apparatus is shown. The 
apparatus is operable in a knowledge acquisition mode and a document classification 
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mode. In knowledge acquisition mode, the apparatus learns from training documents and 
rules to recognise categories based on document content. This knowledge is then applied 
in document classification mode to classify further documents. The structure of the 
apparatus is shown in Figures 1 and 2 and will now be described with reference to the 
5 document classification mode. The structure is knowledge acquisition mode in the same, 
but used differently as described with reference to Figure 3. 

A document text file, for example a text file derived from a scanned and OCR processed 
physical document or derived from a received and stored facsimile message which has 

10 been analysed for and converted to textual content, or a word processor document file, is 
fed to a document classifier 10. The document classifier includes a feature extraction 
module 20 which analyses the text file and extracts previously selected features in the 
form of keywords or phrases from that file which are fed as a feature vector to a classifier 
30 which is in the form of an ARAM (Adaptive Resonance Associative Map) system 

15 which provides a predicted classification for the output document in response to the input 
feature vector. This classification is associated with a confidence value which, together 
with the document, is passed to a router 40. At the router 40, the value is compared to a 
threshold input by a system administrator 50. If the value exceeds the threshold, the 
document is routed to the destination 52 specified by the classification, via path 55. If 

20 not, the document is routed to the system administrator 50 for manual routing via path 60. 
The destinations 52 can also communicate with the system administrator 50 through path 
60, to return misdirected documents for manual routing. 
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The modes of operation of the apparatus are shown in Figure 3. In knowledge 
acquisition mode, two sub-modes are used. The first, represented by block 100, is based 
on learning and requires the input of training data in the form of documents, for each of 
which a feature vector is extracted by module 20 and fed to module 30. The training 
5 documents can either be input individually or as a batch. The actual category of the 
document is input by system administrator 50 and fed to the module 30. Module 30 then 
adjusts (if necessary) the association of the vector to the predicted category in a knowledge 
base 105 so that the predicted category equals the actual category. The second sub-mode 
is based on rule insertion, represented by block 110. In rule insertion, a feature vector 
10 and an actual category are input by the system administrator 50 and an association 
between the input vector and the actual category is made, if one does not already exist. 

In document classification mode, represented by block 120, the feature vector and 
document are fed to the module 30 and based on the knowledge acquired by the 
15 knowledge base in the knowledge acquisition mode, a classification is determined in 
accordance with the feature vector and the classification is output together with the 
document. 

The system administrator can access the document classifier directly by via path 70 to 
20 allow switching between the knowledge acquisition sub-modes and the document 
classification mode. Such switching may be used, for example, if a mis-directed 
document has been returned to the system administrator. The system administrator may 
then cause the document classifier to enter the learning sub-mode of knowledge acquisition 
mode, the system administrator inputting the correct classification for the document to the 
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classifier 30 together with the document to the feature extraction module 20, from which 
the features are extracted and passed to the classifier 30, so that the mis-directed document 
and associated correct classification are added to the knowledge base. 

5 Similarly, at any point in operation of the document classification apparatus, the systems 
administrator can add additional training documents and/or rules by switching from 
document classification mode to knowledge acquisition mode. 

The highlighted processes will now be explained in more detail. 

10 

Feature Selection 

For document classification, there is a need to represent text documents in some format- 
and language- independent form, commonly termed a feature representation, before 

15 processing by a classifier. One of the most common forms of representation of features is 
that of singular word tokens. Specifically, the tokens are individual words that have been 
extracted from each document and transformed to their root form (e.g. root form of 
"selection" is "select"). Other "filtering" options based on sentence structure, such as 
recognizing only nouns while ignoring other word types such as prepositions and 

20 conjunctions, can also be used as will be apparent to those skilled in the art, in 
dependence upon requirements. 

The keyword-based feature sets can be pre-defined manually or generated automatically 
from a set of pre-labeled documents. 



WO 01/14992 

i 



9 



PCT/SG99/00089 



The algorithm for automatic keyword selection accepts a list of pre-classified (i.e. 
training) text documents which are analyzed, processing one document at a time. 
Processing involves the extraction of all nouns (in root form) from each document and 

5 recording the number of occurrences of each of these prospective keywords within each 
category as well as within each document. Based on a certain set of selection rules, an 
overall rating of the "quality" of each word as a keyword is calculated and the list of 
keywords sorted by this value. The top N keywords with the highest rating are then 
selected as the "feature space" to be used for representing all documents (training or 

10 otherwise). The algorithm uses four different selection rules in ranking keywords which 
are combined to form a selection rating (f rat i ng ). These are: 

(a) Class Entropy 

(b) Document Entropy 

(c) Relative Document Count 
15 (d) Document Inclusion Rate 

a) Class Entropy (f CE ): this measures the distribution of a keyword's occurrence across the 
different categories. The more "polarized" the keyword's occurrence is towards a 
particular category, the more significant will the keyword be. This is because a keyword 
20 which occurs almost solely within one category and not at all in the others is much more 
likely to have some non-trivial association with the that category, as compared with a 
keyword which has a more even distribution across the categories. 

The formula used to calculate class entropy for C different categories is: 
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where: 



Countfx) = Total number of occurrences of keyword in category x 



5 



b) Document Entropy (f DE ): this measures the distribution of a keyword's occurrence 
across the different documents in a particular category. The criteria for a good keyword 
here is the opposite of that for Class Entropy. Here, the keyword which is much more 
evenly distributed across the documents in one category is a much better feature than one 
10 that has a more "polarized" distribution. This is because a keyword that occurs in more 
documents within a category is more likely to be one more commonly associated with 
documents of that category. 

The formula used to calculate document entropy for D documents within 1 category, is: 



c) Relative Keyword Count (f RKC ) : for a particular keyword, the top 2 document categories 
20 are defined as the 2 categories with the highest absolute count for that keyword. The 




15 



where: 



Count(x) = Total number of occurrences of keyword in document x 
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keyword-per-document ratio (f Ralio t ) for a category, i, is the total keyword count (Q for the 
category divided by the total number of documents (Dj) in that category. This relation can be 
expressed simply as: 

5 fiMoi-Q/Dj 

Relative Keyword Count thus gives an indication of the difference between the keyword- 
per-document ratio of the 1 st (f Ratiol ) and 2 nd (f Raljo ,) categories. A keyword with a large 
difference between f RatioI and f Ralio2 is better than one with a small difference. 

10 

A measurement of f RKC for C different categories is given by: 

^RKC = (f Ratiol _ ^ Ratio2 ) ^ ^ Ratiol 

d) Document Inclusion Rate (f DIR ): f RDC can be skewed by the high number of occurrences 
of a keyword in just one or two documents of a category. The use of f DIR helps to "balance 
15 out" such situations by considering the number of documents in the top category in which 
the keyword occurs at least once. 

A measurement of f D1R for D lst documents in the top category is given by: 
f W R =O lsl /D lst 

20 where: 

0 lst = number of documents in top category in which keyword occurs. 
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The overall ranking of each keyword is therefore simply derived by taking: 

f Ranking = ^ CE X ^ DE X ^ RDC X ^ DIR 

with: 

5 0.0 <= f Ranking <= 1.0 

In this case, equal weightage has been given to each factor. Different coefficients could 
easily be added to each factor to give it a larger or smaller weightage. 

10 The following example uses a small training set of two categories with 124 relevant 
documents each. The categories are business newspaper articles in the first category and 
non-business (e.g. sports) articles in the other. Consider a sampling of 40 keywords taken 
from the set of all keywords selected from the training sets. The total count of each 
keyword within each category as well as the number of documents (per category) in which 

15 it occurs, is as shown in Table 1. In Table 2, the "paths" (as shown by arrowed lines) of 
two exemplary keywords as they are ranked according to the four different factors, 
together with the final rating are shown with the combination of the four factors helping to 
provide a better overall view of the relative suitability of each keyword. 



20 
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TABLE 1 



1\C} v> Ul i.1 


Total count 


Unit 

document 
count 


Cat 1 


Cat 2 


Cat 1 


Cat 2 


annual 


33 


2 


22 


2 


authority 


27 


2 


23 


2 


bank 


92 


6 


28 


4 


capital 


61 


I 


29 


1 


cent 


560 


32 


87 


21 


champion 


o 


60 


o 


31 


coach 


o 


26 


o 


19 


company 


191 


12 


65 


8 


corporate 


48 


4 


26 


3 


cup 


0 


54 


0 


20 


distribute 


17 


o 


15 


0 


economy 


87 


1 


28 


1 


event 


2 


34 


1 


25 


exchange 


34 


2 


23 


2 


tan 


1 


36 


1 


18 


final 


12 


98 


10 


45 


game 


I 


66 


1 


32 


industry 


137 


6 


59 


6 


invest 


99 


2 


46 


2 


mainboard 


26 


0 


22 


0 


market 


175 


4 


64 


3 


match 


6 


65 


3 


28 


minister 


41 


3 


23 


3 


pe 


499 


43 


91 


21 


play 


12 


181 


1 1 


55 


potential 


31 


0 


23 


0 


profit 


78 


5 


28 


3 


property 


96 


3 


37 


3 


rate 


113 


2 


34 


2 


round 


0 


60 


0 


28 


score 


0 


30 


0 


17 


share 


217 


8 


47 


8 


star 


0 


49 


0 


23 


stock 


104 


0 


40 


0 


technology 


50 


0 


26 


0 


tournament 


0 


34 


0 


19 


venture 


53 


I 


20 


1 


victory 


0 


30 


0 


20 


win 


9 


149 


5 


60 


Woman 


0 


50 


0 


22 
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The algorithm allows for the specification of a minimum number K. of non-zero keyword 
counts which are expected to be found within each training document. The training 
documents are pre-processed by the method described above to determine the number of 
5 non-zero keyword counts in each document. Whenever a training document is found to 
have too few non-zero keyword counts, the next highest ranked keywords within the 
document are added to the set of N keywords initially selected, to bring the number of 
non-zero keyword counts for that document up to K. The total number of unique "bonus" 
keywords B, extracted from all training documents thus increases the dimension of the 
10 feature space to N+B. 

Keyword Extraction 

Once the keywords have been selected in the manner described above, keywords are 
15 extracted from a document and are formed into a feature vector, using the N-fB set of 
keywords obtained during the selection process as the limited set of significant keywords 
that are to be searched within the document. This procedure is applied to both training 
documents to produce a set of respective training feature vectors and new documents to 
produce, respective feature vectors for yet-to-be categorized documents. 

20 

Based on the selected keyword features, the feature extraction algorithm parses the 
document to record the number of times a keyword w s appear in the document (c). The 
keyword counts are then normalized such that the maximum score is 1 and the minimum 
score is 0. These scores are then provided as input to the classifier 30 as a normalized 
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keyword frequency count feature vector which encodes the statistical distribution of the 
keywords in the documents and thus provides a rough representation of the document 
content. 

5 The feature extraction process using two sample articles is illustrated below. The first 
article, for Category 1 (business section) produces a positive word count for certain 
predominantly business-related keywords which are converted to relative frequency values 
as shown to form the input vector. The second article, for Category 2 (sports, music and 
life section) produces a positive word count for certain predominantly sports-related 
10 keywords which are likewise converted to relative frequency values to from the input 
vector. 
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Sample article for Category 1 (business section) 



JUN 30 1997 Stationery maker Nippecraft in the red 

MA IN BOA RD- LI STED specialist stationery maker Nippecraft 
has reported losses of $12 million for the year ended March, but 
said a company reorganisation would improve its bottomline this 
year. 

The losses came on the back of a 4 per cent drop in turnover to 
$64 million and include exceptional and ordinary charges totalling 
close to $1 1 million, according to the company's unaudited 
results. 

There will be no dividend payouts this year. Net tangible asset 
backing per ordinary share dropped to 0.69 cent, from 7.82 cents 
last year. The results were in line with NippecrafYs projections 
announced in February. 

Managing director Bill Habergham attributed a large part of the 
loss to reorganisation of businesses in Britain, the United States. 
Australia and Malaysia. 

"The exercise has now largely been completed and 
notwithstanding the tougher prospects ahead, we expect to reap 
the benefits of the reorganisation and restructure exercise in the 
current financial year," a company statement quoted him as 
saying. 

Nippecraft said the exceptional charges, amounting to $6.8 
million, included the writing down of stock by $5 million, 
operating losses and costs associated with the closure or 
restructuring of subsidiaries which cost close to $2 million. 

The group managed to reduce inventory levels by a third to SI 8 
million. Mr Habergham said this would benefit the group in the 
long term. 
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Keyword Table for Sample Article for Category 1 



Keyword 


Count 


Relative Frequency 


market 


0 


0.0 


cent 


3 


1.0 


pe 


2 


0.7 


industry 


0 


0.0 


company 


3 


1.0 


invest 


0 


0.0 


develop 


0 


0.0 


stock 


1 


0.3 


share 


1 


0.3 


list 


1 


0.3 


property 


0 


0.0 


technology 


0 


0.0 


capital 


0 


0.0 


economy 


0 


0.0 


sector 


0 


0.0 


billion 


0 


0.0 


potential 


0 


0.0 


mainboard 


1 


0.3 


project 


1 


0.3 


play 


0 


0.0 


win 


0 


0.0 


champion 


0 


0.0 


game 


0 


0.0 


rate 


0 


0.0 


round 


0 


0.0 


star 


0 


0.0 


final 


0 


0.0 


woman 


0 


0.0 


victory 


0 


0.0 


tournament 


0 


0.0 
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Sample article for Category 2 (sports, music & life section) 



JUL 2 1997 Love-fit Testud clinches famous win 

LONDON -- One point short of a famous victory, Sandrine 
Testud roiled her eyes to the leaden skies then across the net to 
Monica Seles, shifting nervously from foot to foot. 

Finally she turned to her Italian boyfriend Vittorio, huddled 
among the spectators overlooking the No. 3 court. 

On his signal she served straight, deep and fast for her sixth ace 
and an astonishing 0-6, 6-4, 8-6 win over the Wimbledon second 
seed. 

Nothing had looked less likely half-an-hour into Monday's 
third-round match. 

On a court of low and uncertain bounce after the heavy rain 
which ravaged the opening week, Seles breezed through the first 
set in exactly 30 minutes. 

"I was just trying to start playing," said the unseeded Testud. 

"I was just so slow and nothing was going right." 

The 25-year-old Frenchwoman, who lives and trains in Rome, 
finally won a game, holding serve in the second game of the 
second set after dropping the first two points. 

She broke Seles in the next game. Hitting longer and with more 
power as she gained in confidence, Testud held service to win the 
second set in 42 minutes. 

"I got a little bit tight, missed a couple of shots and the set was 
gone," Seles reflected. 

Vittorio's contribution apparently goes further than his court-side 
advice. 

How have you gotten so fit?, Testud was asked. 
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Keyword Table for Sample Article for Category 2 



Keyword 


Count 


Relative Frequency 


Market 


0 


0.0 


Cent 


0 


0.0 


Pe 


0 


0.0 


Industry 


0 


0.0 


Company 


0 


0.0 


Invest 


0 


0.0 


develop 


0 


0.0 


stock 


0 


0.0 


share 


0 


0.0 


list 


0 


0.0 


property 


0 


0.0 


technology 


0 


0.0 


capital 


0 


0.0 


economy 


0 


0.0 


sector 


0 


0.0 


billion 


0 


0.0 


Potential 


0 


0.0 


Mainboard 


0 


0.0 


Project 


0 


0.0 


JTlciy 


u 


0.0 


Win 


4 


1.0 


Champion 


0 


0.0 


Game 


3 


0.8 


rate 


0 


0.0 


round 


1 


0.2 


star 


0 


0.0 


final 


2 


0.5 


woman 


0 


0.0 


victory 


1 


0.2 


tournament 


0 


0.0 
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The Classifier: Adaptive Resonance Associative Map (ARAM) 
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ARAM is a family of neural network models that performs incremental supervised 

learning of recognition categories (pattern classes) and multidimensional maps of both 

5 binary and analog patterns. An ARAM system is shown in Figure 4 and can be visualized 

as two overlapping Adaptive Resonance Theory (ART) [1,2,3] modules consisting of two 

input fields F, a (300) and F, b (310) with an F 2 category field (320). For classification 

problems, the F, a field (300) serves as the input field containing the input activity vector 

and the F, b field (310) servers as the output field containing the output class vector. The 

10 F2 field (320) contains the activities of categories that are used to encode the patterns. 

During learning, given an input pattern presented at the F, a input layer and an output 

* 

pattern presented at the output field, a F 2 category node is selected to encode the 
pattern pair. 

15 When performing classification tasks, ARAM formulates recognition categories of input 
patterns, and associates each category with its respective prediction. The knowledge that 
ARAM discovers during learning is compatible with IF-THEN rule-based representation. 
Specifically, each node in the F 2 field (320) represents a recognition category associating 
the F, a input patterns with the F, b output vectors. Learned weight vectors, one for each F 2 

20 node, constitute a set of rules that link antecedents to consequents. At any point during the 
incremental learning process, the system architecture can be translated into a compact set 
of rules. Similarly, domain knowledge in the form of IF-THEN rules can be inserted into 
ARAM architecture. 
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The ART modules used in ARAM can be ART 1 [1], which categorizes binary patterns, 
or analog ART modules such as ART 2-A [2], and fuzzy ART [3], which categorize both 
binary and analog patterns. The fuzzy ARAM model, that is composed of two overlapping 
fuzzy ART modules is described below. 

5 

Knowledge Acquisition Mode 
Learning Sub-Mode 

In the learning sub-mode of knowledge acquisition mode, ARAM learns a set of 
10 recognition categories or rules by training from pre-labeled document sets. During 
learning, the keyword frequency vectors, each representing a document, are presented to 
ARAM as input A one at a time together with the associated class label input B. 

Given an input keyword vector A, ARAM first searches for a F 2 recognition category 
15 encoding a keyword template vector that is closest to the input vector according to some 
similarity measure. It then checks if the associated F 2 output prediction of the selected 
category matches with the output class label B. If so, under fast learning, the keyword 
templates of the F 2 recognition category is modified to contain the intersection of the 
original keyword templates and the input keyword vector. Otherwise, the recognition 
20 category is reset and the system repeats to select another category until a match is found. 

Given a set of the documents with a specific class label, the system learns to pick up 
combinations of keywords that consistently appear in the documents and derive rules that 
associate combinations of the relevant keywords to the target output class of the 
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documents. ARAM learning is stable in the sense that weight values do not oscillate, as 
they can only decrease but not increase. As new cases are incorporated by adjusting the 
weight templates of the chosen category nodes, learning does not wash away previously 
learned knowledge. This allows incremental learning in the sense that learning of new 
5 cases does not require relearning of old data. 

The detailed algorithm for incremental learning is given below: 

Input vectors: The and input vectors are normalized by complement coding that 
10 preserves amplitude information. Complement coding represents both the on-response and 
the off-response to an input vector a. The complement coded F, a input vector A is a 2M- 
dimensional vector 



A — (a, a) — (ai,...,a M ,aj ,...,a M ) 



15 



Where ^ represents the normalized frequency score of keyword w i5 and a s c = 



Similarly, the complement coded F, b 



input vector B is a 2N-dimensional vector 



20 



B = (b, b c ) = (b lf ... f b Nf b l c b N c ) 



where b t represents the presence (b k =l) or absence (b k =0) of a class label c k , and b k c = 



1 
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Activity vectors: Let x a and x b denote the F, 1 and F, b activity vectors respectively. Let y 
denote the F 2 activity vector. 

Weight vectors: Each F 2 category node j is associated with two adaptive weight templates 
5 Wj a and Wj b . Initially, all category nodes are uncommitted and all weights equal ones. After 
a category node is selected for encoding, it becomes committed. 

Fuzzy ARAM dynamics are determined by the choice parameters a a >0 and cc b >0; the 
learning rates p a in [0,1] and p b in [0,1]; the vigilance parameters p a in [0,1] and p b in 
10 [0,1]; and a contribution parameter y in [0,1]. 

Bottom up propagation: Given the F, a input vectors A, for each F 2 node j, the F t a to F 2 
input Tj is defined by : 

15 T. = | A * w/| /(ct a +|w/|) 

where the fuzzy AND operation * is defined by (p A q). = min (n,^), and where the norm 
| . | is defined by | p j = Z s p { for vectors p and q. 

20 Category choice: Using a choice rule, at most one F 2 node can become active. The choice 
is indexed at J where T ; = max {TV for all F 2 node j}. 

When a category choice is made at node J, yj = l; and yj=0 for all j not equal to J. 
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Resonance or reset: Resonance occurs if the match functions, m/ and mj b , meet the 
vigilance criteria in their respective modules: 

5 m; = |A - w, a | / | A| > = p a and m, b = |B - <| / |B| > = p b . 



Learning then ensues, as defined below. If any of the vigilance constraints is violated, 
mismatch reset occurs in which the value of the choice function Tj is set to 0 for the 
duration of the input presentation. The search process repeats to select another new index 
10 J until resonance is achieved. 



Learning : Once the search ends, the weight vectors w, a and w ; b are updated according to 
the equations 

15 w/ (ncw) = (l-p a ) w/ (old) + p a (A " w/ (oM) ) 

and 

Wjb (new) = (1 _p b) ^ Mold) + p b (B * Wj b( ° ld) ) 



respectively. For efficient coding of noisy input sets, it is useful to set (3 a = p 5 = 1 when 
20 J is an uncommitted node, and then take P a < 1 and p b < 1 after the category node is 
committed. Fast learning corresponds to setting p a = p b = 1 for committed nodes. 
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Match tracking: At the start of each input presentation, the vigilance parameter p a equals a 
baseline vigilance p a . If a reset occurs in the category field F 2 , p a is increased until it is 
slightly larger than the match function m } \ The search process then selects another F 2 
node J under the revised vigilance criterion. 

5 

Rule Insertion Sub-Mode 

Through the rule insertion process, user-defined knowledge in the form of rules is inserted 
into the ARAM network (knowledge base). A rule is typically in the IF-THEN format, 
10 consisting of a set of keyword features as the antecedents and a classification as the 
consequent. Due to the compatibility of ARAM architecture and rules, domain knowledge 
in the form of IF-THEN rules can be readily inserted into an ARAM network. 



For example, given a rule such as 
15 "Stock", "Share", "Price" -> Business, 

the rule insertion algorithm creates a keyword frequent vector in which the frequency 
score of "Stock", "Share" and "price" are Is and all others zeros; and an output class 
vector in which the score of "Business" is 1 and all others zeros. Given the keyword 
20 frequency vector in the F, a field, and the class vector in the F, b field, ARAM first searches 
for a recognition category that encodes the exact same set of keywords. If such a 
recognition category exists and its predicted class is "Business", no action is required as 
the rule already exists. If the predicted class is not "Business", a contradiction occurs and 
it is flagged to the users. If such a recognition category does not exist, a recognition 
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category is created to encode a keyword template consisting of "Stock", "Share", and 
"Price" and a classification of "Business". 

The detailed rule insertion algorithm is as follows: 

5 

Rule insertion proceeds in two phases. The first phase translates each rule into a 2M- 
dimensional vectors A and a 2N-dimensional vectors B, where M is the total number of 
document features and N is the number of classes. 

10 In the most general case, given a rule of the following format, 

IF x„ x,, x m5 not(x',), not(x\), not(x' m ) 
THEN y„ y 3 , y„ not(y',), not(y\), not(y'J 

15 

where x„ x 2 , x m and y„ y 2 , y n are positive attributes, and x*„ x' 2 , x' m and y',, 
y'2. — . y' n preceded by the logical NOT operator are negative attributes, the algorithm 
derives the pair of vectors 

20 A = (a, a c ) and B = (b, b c ) 

such that for each index j = 1, ...,M, 



(a j, a j c ) = 
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(1,0) if w—Xj for some i in {1, ...,m} 
(0,1) if Wj^x'j for some i in {1, ...,m'} 
(0,0) otherwise 

and 

5 

(b k , b c k ) = 

(l,0)ifc k =yi for some i in {l,...,n} 
(0,1) if c k =y'| for some i in {l,...,n'} 
(0,0) otherwise 

10 

where Wj is the j th keyword feature and c k is the k lh class label. 

The vector pairs derived from the rules are then used as training patterns to initialize an 
ARAM network. Given a pair of vectors A and B derived from a rule, their respective 
15 recognition categories are associated through the map field. 

During network initialization, the vigilance parameters p a and p b are each set to 1 to ensure 
that only identical attribute vectors are grouped into one recognition category. Contradictory 
symbolic rules are detected during rule insertion when identical input attribute vectors are 
20 associated with distinct output attribute vectors. The detection is achieved through a perfect 
mismatch phenomenon, in which the system tries to raise p a above 1 in response to a 
mismatch in F, a . 



WO 01/14992 PCT/SG99/00089 

i 

29 

Document classification 

Given an input document, a feature extraction module parses the text to derive a 
normalized keyword frequency vector (as described above). The complement coded input 
5 vector A is then presented to the F* field. 

Given an input keyword vector A, ARAM first searches for a F 2 recognition category 
encoding a keyword template vector that is closest to the input vector according to the 
choice function. The associated F 2 output prediction of the selected category is then used 
10 as the output class label. 

Choice Rule: In ARAM systems with category choice, only the F 2 node J that receives 
maximal F, a to F 2 input T } predicts output. Specifically: 

15 yj = 1 if j = J where T } > T k for all k not equal to J; 
0 otherwise 

The F, b activity vector x b is given by x b = SjWjVj = w ; b an d the output vector B 2 = (b,, b 2 ,..., 
b N ) is then read directly from the F, b field. The output class is predicted to be K if b K > b k for 
20 all k not equal to K and the confidence value is given by b K . 
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Confidence Value 

Given training examples and rules of a single class output and with fast learning, ARAM 
5 associates input features to a binary class prediction. In other words, only one output class 
b K equals one and b k = 0 for all k not equal to K. To derive a real value prediction score 
between 0 and 1, a few strategies are possible, of which two are described below: 

a) Distributed category prediction 

Using distributed category prediction, more than one F 2 nodes can become active. The F 2 
10 output vector y represents a less extreme contrast enhancement of the F, a to F 2 input T, in 
the sense that the higher T/s are amplified and smaller T/s are further reduced. Two 
algorithms that approximate contrast enhancement are given below. 

Power Rule: The power rule raises the input T } to the j* F 2 node to a power p and 
15 normalizes the total activity: 

y, = (T// Z k (T k )>. 

The power rule converges toward the choice rule as p becomes large. 

20 

K-max Rule: In the spirit of the K Nearest Neighbor (KNN) system, the K-max rule picks 
the set of K F 2 nodes with the largest input T 3 for prediction. The F 2 activity values yj are 
then: 
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yj - Tj / E kinjt T fc if j in 7i 
0 otherwise, 

5 where n is the set of K category nodes with the largest Tj values. The K-max rule with 
K=N is equivalent to the power rule with p = 1. 

After the F 2 activity vector y is contrast enhanced by the power rule or the K-max rule, 
the output activity vector x b in the F, b field computed by 

10 

x b = 2j w/ Vj 



The output vector B, = (b„ b,,..., b N ) is then read directly from x b . The output class 
predicted to be K if b K > b k for all k not equal to K and the confidence value is given by b K . 

15 



is 



b) Voting strategy 

Using the voting strategy technique, multiple ARAM systems are inserted with different 
sets of rules and/or trained using different sets of input patterns or different orderings of 
20 the input patterns. When performing classification, each ARAM votes for its predicted 
class. The voting scores normalized by the number of ARAM provide a prediction score 
between 0 and 1 for each output class. 
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where Vj is the number of votes given to and s } is the normalized prediction score for the 
output class j. The output class with the highest prediction score is the selected predicted 
5 class and its prediction score is the confidence value. 

Switching between modes 

The system administrator can switch between the classification mode and the knowledge 
10 acquisition sub-modes by sending a message together with the appropriate data to the 
document classifier. The message can be either LEARN, INSERT, or CLASSIFY. 
Depending on the message received, the document classifier adjusts the input baseline 
vigilance parameter & and the output vigilance parameter p b of the ARAM classifier 
accordingly and carries out the appropriate sequence of actions. 

15 

With a LEARN message, the document classifier receives a document text together with a 
classification label. First, the feature extraction module derives the normalized keyword 
frequency vector from the document. The keyword vector is presented as the input vector to 
the F, a field and the classification vector (based on the classification label) is presented to 
20 the F x b field of the ARAM classifier. The ARAM classifier is then run with & < 1 (typically 
0, to maximize generalization) and p b = 1 . 

With an INSERT message, the document classifier receives an IF-THEN rule. First, the rule 
insertion module converts the given rule into a pair of input and output vectors, presents the 
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input vector to the F, a field and the output vector to the F, b field. The ARAM classifier is 
then run with both the input and output vigilance parameters set to 1 s. 

With a CLASSIFY message, the document classifier receives a document text. First, the 
5 feature extraction module derives the normalized keyword frequency vector from the 
document and presents it as the input vector to the F t a field. The ARAM classifier is then 
run with both p a and p b equal to zeroes to ensure a prediction is always made. The predicted 
classification label is then read from the F, b field and returned to the user. 

10 Figures 5, 6, and 7 summarize the parameter setting and the relevant functional blocks of the 
document classification system in the learning, rule insertion, and document classification 
modes respectively. 

The embodiment described is not to be construed as limitative. For example, elthough the 
15 classifier module has been shown implemented using an ARAM structure, this may be 
implemented using any other structure which allows incremental learning and rule 
insertion, such as an ARTMAP structure. 
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CLAIMS 

1. Document classification apparatus comprising feature extraction means for 
extracting a plurality of features from a document and a classifier operable on the 
extracted features to process the document in a knowledge acquisition mode in 
which the association of a classification with the document is added incrementally 
to a knowledge base or in a document classification mode in which the classifier, 
using the knowledge base, determines a predicted classification for the document, 
the classifier being switchable between the modes under user control . 

2. Apparatus as claimed in claim 1 wherein the classifier comprises a supervised 
adaptive resonance theory (ART) system. 

3. Apparatus as claimed in claim 2 wherein the system comprises an ARTMAP 
system. 

4. Apparatus as claimed in claim 2 wherein the system comprises an ARAM system. 

5. Apparatus as claimed in any one of the preceding claims further comprising a 
router arranged to route the document to one of a plurality of destinations in 
dependence upon the classification. 



6. 



Apparatus as claimed in any one of the preceding claims wherein the classification 
has associated therewith a confidence value. 
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7. Apparatus as claimed in claim 6 as dependent on claim 5 wherein the confidence 
value is comparable to a threshold, the router being arranged to make an automatic 
routing or manual routing decision in dependence upon the comparison. 



8. Apparatus as claimed in claim 7 wherein the threshold is adjustable. 



9. Apparatus as claimed in claim 7 or claim 8 wherein a said destination is a system 
administrator, responsible for manual routing. 



10. Apparatus as claimed in any one of the preceding claims wherein the features are 
formed into a feature vector for input to the classifer. 



11. Apparatus as claimed in any one of the preceding claims wherein the features 
comprise classification-associated words or phrases which may appear in the 
document. 



12. Apparatus as claimed in any one of the preceding claims wherein the extracting 
means is arranged to provide a measure of the frequency of occurrence of ^ 
features in the document. 



13. 



Apparatus as claimed in claim 5 wherein the destinations include a system 
administrator to which the other destinations are connected, mis-routed documents 
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being sendable by the other destinations to the system administrator for manual 
routing. 

14. Apparatus as claimed in claim 13 wherein the system administrator is connected to 
5 the feature extraction means and classifier, the arrangement being such that a said 

mis-directed document, in association with an actual classification supplied by the 
system administrator, is processed in knowledge acquisition mode to add the 
association of the actual classification with the mis-directed document to the 
knowledge base. 

10 

15. Apparatus as claimed in any one of the preceding claims wherein the apparatus is 
operable to perform rule insertion in the knowledge acquisition mode in which a 
plurality of features are input by a user to the classifier together with a 
classification with which the features are associated. 

15 

16. Apparatus as claimed in any one of the preceding claims wherein the apparatus is 
operable in knowledge acquisition mode to process a plurality of training 
documents with associated classifications as a batch. 

20 17. Document classification apparatus comprising: 

feature extraction means for extracting a plurality of features from a document, 
a classifier operable, using a knowledge base, to determine from the features a 
predicted classification for the document, the classification having a confidence 
value associated therewith; and 
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a router arranged to compare the confidence value to a threshold and make a 
decision to route the document automatically to one of a plurality of destinations or 
to a destination for manual routing in dependence upon the comparison. 

5 17. Apparatus as claimed in claim 13 wherein the threshold is adjustable. 
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